1 Preparatory steps for the Markdown file

Setting the options for knitr

library(knitr)
knitr::opts_chunk$set(results = 'asis',
                       echo = TRUE,
                       comment = NA,
                       prompt = FALSE,
                       cache = FALSE,
                       warning = FALSE,
                       message = FALSE,
                       fig.align = "center",
                       fig.width = 8.125,
                       out.width = "100%",
                       fig.path = "Figures/figures_mlr_svm/",
                       dev = c('png', 'tiff'),
                       dev.args = list(png = list(type = "cairo"), tiff = list(compression = 'lzw')),
                       cache = TRUE,
                       cache.path = "D:/cache/cache_mlr_svm/",
                       autodep = TRUE)
options(width = 1000, scipen = 999, knitr.kable.NA = '')

2 Preparatory steps

We load the required libraries…

library(tidyverse) # We use tidyverse to work with tibble and rely on pipes for successive operations
library(mlr) # We use the package mlr for our ML pipeline (another well-known pacakge is caret. mlr has actually been superseded by mlr3, but it's object-oriented approach is a bit more complex to apprehend at first)
library(mlrMBO) # This package contains what is needed for a Bayesian model-based optimization of the hyper-parameters of the svm algorithm
library(splitstackshape) # This package contains the function 'stratified' which we use to create a training set and a test set
library(parallel) # This package contains the function 'detectCores' to detect how many cores are available
library(parallelMap) # This package allows to parallelize the exploration of the value space for the hyper-parameters
library(kableExtra) # To display nice tables
library(uwot) # To compute (supervised) UMAP
library(plotly) # To display interactive figures

We clean the memory (just in case):

rm(list = ls(all.names = TRUE))

We set the random seed for reproducibility purpose:

set.seed(123)

We load the data:

# setwd("D:/") # Setting the working directory (when compiling a markdown file, the default folder is the one which contains this file)
filename <- "Datasets/dataset_processed.txt"
df <- read.table(filename, header = TRUE, sep = "\t", dec = ".", quote ="", stringsAsFactors = T)
df <- df %>% as_tibble() # We convert the data frame as a tibble

3 Step-by-step process

3.1 Defining the predictors and predicted variable

In this demo, we are going to predict individual signatures with a small set of acoustic features consisting of duration, vocalization.HNR and 5 DCT coefficients.

We define our predicted variable, which corresponds to one of the columns of our data table ‘df’:

DV <- "individual"

We define our feature set / set of predictors, which correspond to several columns in our data table ‘df’:

DCT_set <- c("duration", "vocalization.HNR", "dct0", "dct1", "dct2", "dct3", "dct4")

We prepare a training set and a test set with a 80%-20% split. The proportions of individuals will be the same in both sets.

partition <- stratified(df, "individual", 0.8, bothSets = TRUE) 
training_set <- partition$SAMP1 # We extract the training set from the 'partition' object
test_set <- partition$SAMP2 # We extract the test set from the 'partition' object

We further need to define tasks that the learner must perform. One task relates to the training set, and another one to the test set. To prepare each task, we need to keep only the predictors we are interested in and the predicted value (individual):

training_task <- training_set %>% # We start from the tibble 'training_set'
  select(all_of(c(DCT_set, DV))) %>% # We select the columns for the predictors and the predicted variable (all_of is used to pass strings of characters to select())
  as.data.frame() %>% # The function makeClassifTask() expects a data frame (it will convert a tibble with a warning message, which we avoid with this line)
  makeClassifTask(data = ., target = DV) # We define the task and indicate which variable/column is the predicted variable

We do exactly the same for the test task:

test_task <- test_set %>%
  select(all_of(c(DCT_set, DV))) %>%
  as.data.frame() %>%
  makeClassifTask(data = ., target = DV)

3.2 Computing and displaying a preliminary supervised UMAP of the calls

Before unfolding our machine learning pipeline further, we can compute and display a supervised UMAP to visualize the result of the dimensionality reduction process when information is provided about the individuals emitting the calls.

We first extract the columns corresponding to the DCT set from df, and call the umap() function (considering 100 neighbors for each observation, 2 target dimensions, a minimum distance of 0.01 between points, and a Euclidean distance) to obtain coordinates for all the observations, before reinserting information about the emitter in the resulting data table:

umap_df <- df %>%
  select(one_of(DCT_set)) %>%
  umap(n_neighbors = 100, n_components = 2, metric = "euclidean", min_dist = 0.01, y = df %>% pull(individual)) %>%
  data.frame() %>%
  as_tibble() %>%
  mutate(individual = df %>% pull(individual))

We can then display the output of the UMAP:

umap_p <- umap_df %>% ggplot(aes(x = X1, y = X2, col = individual)) +
    geom_point(size = 1, alpha = 0.5) +
    scale_color_brewer(palette = "Paired") +
    guides(colour = guide_legend(override.aes = list(size = 2, alpha = 1.0))) +
    ggtitle("Supervised UMAP for types of call represented by DCT coefficients")

umap_p

We can also use the function plot_ly() to display an interactive figure:

umap_df %>%
  plot_ly(x = ~X1, y = ~X2, color = ~individual,
          colors = "Paired", type = 'scatter', mode = 'markers',
          hoverinfo = "text",
          hovertext = paste("Type:", df %>% pull(type),
                            "<br>Individual:", df %>% pull(individual),
                                "<br>Sequence:",  df %>% pull(sequence))) %>%  
  layout(plot_bgcolor = "#ffffff",
         legend = list(title=list(text='Individual')), 
         xaxis = list(title = "X1"),  
         yaxis = list(title = "X2")) 

We can see that while the calls of some individuals are well separated from others, a number of clusters are close to each other with a few isolated calls, which suggests challenges ahead for our upcoming classifier.

3.3 Defining the training and test sets

We prepare a training set and a test set with a 80%-20% split. The proportions of individuals will be the same in both sets.

partition <- stratified(df, "individual", 0.8, bothSets = TRUE) 
training_set <- partition$SAMP1 # We extract the training set from the 'partition' object
test_set <- partition$SAMP2 # We extract the test set from the 'partition' object

We further need to define tasks that the learner must perform. One task relates to the training set, and another one to the test set. To prepare each task, we need to keep only the predictors we are interested in and the predicted value (individual).

training_task <- training_set %>% # We start from the tibble 'training_set'
  select(all_of(c(DCT_set, DV))) %>% # We select the columns for the predictors and the predicted variable (all_of is used to pass strings of characters to select())
  as.data.frame() %>% # The function makeClassifTask() expects a data frame (it will convert a tibble with a warning message, which we avoid with this line)
  makeClassifTask(data = ., target = DV) # We define the task and indicate which variable/column is the predicted variable

We do exactly the same for the test task.

test_task <- test_set %>%
  select(all_of(c(DCT_set, DV))) %>%
  as.data.frame() %>%
  makeClassifTask(data = ., target = DV)

3.4 Creating an svm learner

We create an svm learner:

learner_svm <- makeLearner("classif.svm", predict.type = "prob")

In order to account for the inbalance in the dataset, we create weights for the different individuals in the training set. For each individual to predict, her/his weight is inversely proportional to her/his number of calls:

weights <- training_set %>% pull(individual) %>% table()
(weights <- 1 / weights)

. Bolombo Busira Djanoa Hortense Jill Kumbuka Lina Vifijo Zamba Zuani 0.016393443 0.017543860 0.008620690 0.011235955 0.003448276 0.011494253 0.016393443 0.007812500 0.006493506 0.004854369

We can divide the weights by the minimum value so that our ‘reference level’ is 1:

weights <- weights / min(weights)

We display the weights:

tibble(Individual = names(weights), Weight = round(weights, 3)) %>% kable(align = 'lr') %>% kable_classic() %>% kable_styling(full_width = F, bootstrap_options = c("striped"))
Individual Weight
Bolombo 4.754
Busira 5.088
Djanoa 2.500
Hortense 3.258
Jill 1.000
Kumbuka 3.333
Lina 4.754
Vifijo 2.266
Zamba 1.883
Zuani 1.408

We are going to ‘add’ the weights to our learner:

weighted_learner_svm <-learner_svm %>% makeWeightedClassesWrapper(wcw.weight = weights)

3.5 Tuning the hyper-parameters

While we could pick up some meaningful values for the hyper-parameters of our learner, it is better to rely on hyper-parameter tuning.

We define possible values for our hyperparameters, as informed by the literature on svm:

hp_set <- makeParamSet(
  makeDiscreteParam("kernel", values = c("linear", "polynomial", "radial")), # What type of kernel to use for svm
  makeNumericParam("cost", lower = -7, upper = 7, trafo = function(x) 2^x),  # What possible values for the cost parameter, with a power-2 transformation (i.e., we explore values between 2^-7 and 2^7) 
  makeIntegerParam("degree", lower = 1, upper = 4, requires = quote(kernel == "polynomial")), # If the kernel is polynomial, the possible values for the number of degrees, between 1 and 4
  makeNumericParam("gamma", lower = -13, upper = 5, trafo = function(x) 2^x, requires = quote(kernel == "radial")) # If the kernel is radial, the possible values for gamma, between 2^-14 and 2^5
)

We need to define the strategy to conduct hyper-parameter tuning. We first choose a 5-time repeated 5-fold cross-validation to compute the classification performance:

hpt_strategy <- makeResampleDesc("RepCV", reps = 5L, folds = 5L, stratify = TRUE)

We then need to specify how we want to explore the value space of the hyper-parameters. There are different options here - from a random search to a grid search to a model-based search. We chose the later, which is the most complex but also the most effective. It is a step-by-step process of optimization. More details can be found here: https://mlrmbo.mlr-org.com/.

iters <- 25 # the number of steps/iterations of the optimization process
nb_points <- 25 # The number of paths to explore concurrently in the search space

We define the ctrlMBO object which contains the information about how we want to conduct the search:

ctrlMBO <- makeMBOControl(propose.points = nb_points) # How many points do we want?
ctrlMBO <- setMBOControlTermination(ctrlMBO, iters = iters) # How many iterations do we want?
ctrlMBO <- setMBOControlInfill(ctrlMBO, crit = crit.cb) # How should infill points be rated, i.e. what is the infill criterion? We choose a confidence bound infill criterion, which is good for numerical and non-numerical variables, as well as for parallelization, see https://mlrmbo.mlr-org.com/articles/supplementary/adaptive_infill_criteria.html
ctrlMBO <- setMBOControlMultiPoint(ctrlMBO, method = "cb") # How do we deal with multiple infill points? We choose to optimize the confidence bound 'cb' criterion

We can now define the tune control object:

tuneCtrl <- makeTuneControlMBO(mbo.control = ctrlMBO)

We define the number of cores we want to use for parallelization (we check the number of available cores and keep two available for other tasks):

n_cores <- detectCores() - 2

We launch the tuning of the hyper-parameters with parallelization. This will take more or less time depending on the frequency of the processor and the number of cores:

parallelStartSocket(n_cores) # Starting parallelization
tuning_results <- tuneParams(weighted_learner_svm, # We specify the learner / classifier we want to use
                             task = training_task, # We provide the task to perform
                             resampling = hpt_strategy, # We provide the strategy to compute the classification performance
                             par.set = hp_set, # We provide the various hyper-parameters to tune with their ranges of possible values
                             control = tuneCtrl, # We provide the tune control object
                             measures = logloss) # We finally provide the metric we want to use. We choose logloss here
parallelStop() # Ending parallelization

We can extract a table which contains information about the optimization process (we only display the first 100 rows):

tuning_table <- tuning_results$mbo.result$opt.path %>% # The data we want to access are hidden in the object 'tuning_results'
  trafoOptPath() %>% # We extract the optimization trajectory
  as.data.frame() %>% # We create a data frame with the results we want
  as_tibble() %>%
  rename(logloss = y, iteration = dob) %>%
  select(kernel, cost, degree, gamma, logloss, iteration, prop.type)

tuning_table %>%
  slice(1:100) %>% # We only display the 100 first rows
  kable(align = 'lrrrrll') %>%
  kable_classic() %>%
  kable_styling(full_width = F, bootstrap_options = c("striped"))
kernel cost degree gamma logloss iteration prop.type
linear -2.2319932 1.715623 0 initdesign
radial 5.2621098 -7.5332109 1.630867 0 initdesign
linear 2.6187802 1.721313 0 initdesign
polynomial 0.4443952 1 1.715013 0 initdesign
radial -0.9066463 -2.9805976 1.577050 0 initdesign
linear -5.5887667 1.723540 0 initdesign
polynomial -2.6618218 4 1.903838 0 initdesign
polynomial 2.9609712 2 1.797366 0 initdesign
linear -4.1680640 1.716024 0 initdesign
polynomial 6.7983543 2 1.836778 0 initdesign
radial -0.7750118 3.7032424 2.106134 0 initdesign
radial -4.6077879 -6.7591675 2.001576 0 initdesign
linear 1.1785871 1.718578 0 initdesign
radial 4.6607154 -10.1725355 1.709094 0 initdesign
polynomial -6.5644155 2 2.007421 0 initdesign
linear 4.0449170 1.721126 0 initdesign
radial -1.1802174 -1.5036083 1.582713 1 infill_cb
linear 5.0920968 1.722183 1 infill_cb
radial 5.5699663 -8.5741084 1.665264 1 infill_cb
radial -1.4151071 -1.2036125 1.589656 1 infill_cb
radial -1.2761735 -0.5876800 1.613024 1 infill_cb
radial 5.0664415 -8.8153958 1.678976 1 infill_cb
radial -1.4341483 0.3504361 1.676677 1 infill_cb
radial -1.0438904 -0.0298283 1.647517 1 infill_cb
radial -1.2313317 -1.3169600 1.586680 1 infill_cb
radial -1.0111118 -1.3556128 1.585571 1 infill_cb
radial -1.1538471 -1.0581460 1.593587 1 infill_cb
radial -0.9501010 -1.0279137 1.596225 1 infill_cb
linear 5.0940710 1.721520 1 infill_cb
radial 5.0873455 -8.6565580 1.674572 1 infill_cb
radial 5.1899080 -8.7742261 1.675734 1 infill_cb
radial -1.3695898 -1.4547744 1.583167 1 infill_cb
radial -1.3234391 -0.4367170 1.620767 1 infill_cb
radial -0.8733179 -3.1266769 1.578697 1 infill_cb
radial -0.8737498 -0.7800099 1.607396 1 infill_cb
radial -0.8678430 -3.3537036 1.583869 1 infill_cb
radial 5.1382049 -8.5570591 1.670384 1 infill_cb
radial 5.6653647 -8.2485446 1.653085 1 infill_cb
radial 5.1101553 -8.6918808 1.676334 1 infill_cb
radial -0.9029321 -1.0710657 1.596111 1 infill_cb
radial 5.7186648 -8.7949686 1.670351 1 infill_cb
radial -0.8726805 -3.1074984 1.578075 2 infill_cb
radial -0.3674265 3.7604844 2.091175 2 infill_cb
radial -0.8159256 1.9507666 1.860921 2 infill_cb
radial -0.8714972 -2.6163160 1.572439 2 infill_cb
radial -0.8642606 -4.9375526 1.650150 2 infill_cb
radial -0.8217960 -1.6843194 1.580240 2 infill_cb
radial -0.8707976 2.0051230 1.869590 2 infill_cb
radial -0.8707483 -3.5962151 1.590767 2 infill_cb
radial -0.8707766 -2.8363504 1.573618 2 infill_cb
radial -0.8583843 -3.5884170 1.589698 2 infill_cb
radial -0.8628111 -2.8500774 1.575444 2 infill_cb
radial -0.8224791 -1.2788854 1.590219 2 infill_cb
radial -0.8484318 2.0239444 1.874146 2 infill_cb
radial -0.8493955 -3.1678356 1.579087 2 infill_cb
radial -0.8710940 -2.3343507 1.573237 2 infill_cb
radial -0.8706594 -3.2217565 1.580171 2 infill_cb
radial -0.2129980 1.9113819 1.837900 2 infill_cb
radial -0.8216067 -1.5272592 1.583021 2 infill_cb
radial -0.4730119 4.8857844 2.157431 2 infill_cb
radial -0.6270161 4.2879127 2.138969 2 infill_cb
radial -0.5180113 2.9900039 1.997994 2 infill_cb
radial -0.2823554 4.3826768 2.140999 2 infill_cb
radial -0.8707829 -3.2168079 1.580670 2 infill_cb
radial -0.8688984 -3.2220686 1.580011 2 infill_cb
radial -0.8620165 -2.5763979 1.572601 2 infill_cb
radial -0.8679508 -2.4398315 1.572679 3 infill_cb
radial -1.1528620 1.0728355 1.747046 3 infill_cb
radial -4.6564082 -5.9140116 1.930245 3 infill_cb
radial -0.8716877 -2.6093658 1.573868 3 infill_cb
radial -0.8593957 1.0024164 1.735200 3 infill_cb
radial -0.8552157 -1.0252442 1.598435 3 infill_cb
radial -0.8504703 1.0310663 1.738133 3 infill_cb
radial -0.8626720 -3.0435471 1.577394 3 infill_cb
radial -0.8723369 -2.6446009 1.573672 3 infill_cb
radial -0.8725902 -2.3769305 1.572655 3 infill_cb
radial -0.8695280 1.0357006 1.737587 3 infill_cb
radial -0.8727774 -1.8282629 1.577753 3 infill_cb
radial -0.8700257 -2.2889321 1.572671 3 infill_cb
radial -0.8719769 -2.0592013 1.575379 3 infill_cb
radial -1.1963530 0.9929667 1.739587 3 infill_cb
radial -0.8548052 1.0243863 1.735823 3 infill_cb
radial -0.8687112 0.9901102 1.734606 3 infill_cb
radial -0.8873200 -2.4053089 1.572307 3 infill_cb
radial -0.8718358 -2.7449282 1.574155 3 infill_cb
radial -0.8710987 -2.3619598 1.572911 3 infill_cb
radial -0.8687766 1.1194980 1.747854 3 infill_cb
radial -0.8719155 -2.1592950 1.574341 3 infill_cb
radial -0.8583405 1.0787913 1.741634 3 infill_cb
radial -0.8604120 -2.7128998 1.573055 3 infill_cb
radial -0.9139719 1.0701969 1.741476 3 infill_cb
radial -0.8731555 -1.9474817 1.576163 4 infill_cb
radial -0.8120269 -2.8529733 1.573550 4 infill_cb
radial -0.8729481 -1.9264518 1.577438 4 infill_cb
radial -0.8021972 -3.0526330 1.576690 4 infill_cb
radial -0.8102226 -2.3971541 1.572874 4 infill_cb
radial -0.8712880 -2.4190203 1.572766 4 infill_cb
radial -0.8711560 -2.4437040 1.572401 4 infill_cb
radial -0.8212021 -2.4908515 1.573242 4 infill_cb
radial -0.8722075 -1.8072620 1.578168 4 infill_cb

We can plot how we gradually improve the log loss through the tuning process:

tuning_table %>%
  group_by(iteration) %>%
  summarize(min_logloss = min(logloss)) %>%
  ungroup() %>%
  mutate(min_logloss = cummin(min_logloss)) %>%
  ggplot(aes(x = iteration, y = min_logloss)) +
  geom_line() +
  geom_point() +
  xlab("Iteration step") +
  ylab("logloss")

We extract the best hyper-parameters and the corresponding score:

(best_params <- tuning_results$x)

$kernel [1] “radial”

$cost [1] 0.8495988

$gamma [1] 0.1453802

(best_score <- tuning_results$y)

logloss.test.mean 1.56906

3.6 using the model with the best values for the hyper-parameters

We can now set our svm learner with the optimal values of the hyper-parameters according to the data of the training set:

weighted_learner_svm <- weighted_learner_svm %>% setHyperPars(par.vals = best_params)

We train our learner with the training_task:

trained_learner <- weighted_learner_svm %>% train(training_task)

We can now use the trained learner to perform the test task:

predictions <- trained_learner %>% predict(test_task)

We can compute the performances with three metrics - logloss, accuracy and balanced accuracy:

pred <- performance(pred = predictions, measures = list(logloss, acc, bac))

data.frame(Metric = names(pred), Value = unname(pred)) %>% kable(align="c") %>%
  kable_classic() %>%
  kable_styling(full_width = F)
Metric Value
logloss 1.6180860
acc 0.4726688
bac 0.3699504

We can further display a confusion matrix:

calculateConfusionMatrix(pred = predictions, sums = TRUE, set = "both")$result %>%
  kable(align="c") %>%
  kable_styling(full_width = T)
Bolombo Busira Djanoa Hortense Jill Kumbuka Lina Vifijo Zamba Zuani -err.- -n-
Bolombo 0 0 2 1 3 0 1 0 3 5 15 15
Busira 1 0 0 1 2 1 0 6 2 1 14 14
Djanoa 0 0 11 0 9 2 0 2 0 5 18 29
Hortense 0 0 0 9 5 1 0 1 3 3 13 22
Jill 0 1 0 1 47 6 2 3 6 6 25 72
Kumbuka 1 0 0 2 5 5 2 3 1 3 17 22
Lina 0 0 1 0 3 0 7 1 0 3 8 15
Vifijo 0 0 2 4 10 0 0 10 4 2 22 32
Zamba 1 0 0 3 13 2 0 0 19 1 20 39
Zuani 0 0 4 0 4 1 0 1 2 39 12 51
-err.- 3 1 9 12 54 13 5 17 21 29 164
-n- 3 1 20 21 101 18 12 27 40 68 1249

  1. Département des arts, des lettres et du langage, Université du Québec à Chicoutimi, Chicoutimi, Canada↩︎

  2. Université de Lyon, CNRS, Laboratoire Dynamique Du Langage, UMR 5596, Lyon, France↩︎

  3. ENES Bioacoustics Research Laboratory, University of Saint Étienne, CRNL, CNRS UMR 5292, INSERM UMR_S 1028, Saint-Étienne, France↩︎

  4. Département des arts, des lettres et du langage, Université du Québec à Chicoutimi, Chicoutimi, Canada↩︎

  5. ENES Bioacoustics Research Laboratory, University of Saint Étienne, CRNL, CNRS UMR 5292, INSERM UMR_S 1028, Saint-Étienne, France↩︎

  6. ENES Bioacoustics Research Laboratory, University of Saint Étienne, CRNL, CNRS UMR 5292, INSERM UMR_S 1028, Saint-Étienne, France↩︎

  7. Department of Linguistics, The University of Hong Kong, Hong Kong, China, Corresponding author↩︎